Bioinformatician: Jatin Arora
Student ID: 2502495
Biomedical Microbiology Laboratory
Section A
| STUDENT NUMBER | 2502495 |
|---|---|
| Programme | MSc Bioinformatics |
| Unit Code and Name | BIOLMM37_2023_TB-1 Group Project |
| Project | 2 |
| Assessmment Name | Group Project Alternative Assessment |
| Word Count | Well under 1500 words |
| Do you give permission for you work to be used anonymously in examples given to students in the future? | Yes |
| Title of the Report | PROJECT 2: PATHOGENIC ISLAND DETECTION IN A NEW STRAIN OF YERSINIA |
| Essential UoB Literature Referenced | Workbooks on Accelerated R Couse by Dr Christopher Clements, Statistics Bootcamp, Statistics for Biological Sciences, Year 2 - Quantitative and Computational Methods - 2020 iteration by Multiple Authors and Year 4 - Understanding Data: Experimental Design and Statistics for Life Scientists - 2020 iteration by Multiple Authors on Blackboard (Dr. Christopher Clements, Duncan O’Brien, Duncan Edgley, et al), Genomics Bootcamp by Dr Francesca Segers on Blackboard, Command Line Working Document, and Project Description Powerpoint, by Dr Celine Petitjean on Blackboard |
| Unit Director | Dr Celine Petitjean |
| Date of Submission | Monday, January 22, 2024 |
| Place | Life Science Building (LSB), School of Biological Sciences, Faculty of Life Sciences, University of Bristol |
| Marks based on the summmative computer based assessmment commbininng all the leaarning objectives | 100%. |
(Malmfors, 2000)
By submitting this assignment cover sheet, I confirm that I understand and agree with the following statements:
• I have not committed plagiarism, cheated or otherwise committed academic misconduct as defined in the University’s Assessment Regulations (available at https://www.bristol.ac.uk/media-library/sites/academic-quality/documents/taught-code/annexes/university-examination-regulations.pdf)
• I have not submitted this piece, in part or in its entirety, for assessment in another unit assignment (including at other institutions) as outlined in section 4 of the University’s Assessment regulations (available at https://www.bristol.ac.uk/media-library/sites/academic-quality/documents/taught-code/annexes/university-examination-regulations.pdf)
• I understand that this piece will be scrutinised by anti-plagiarism software and that I may incur penalties if I am found to have committed plagiarism, as outlined in sections 3 of the University’s Examination Regulations (available at https://www.bristol.ac.uk/media-library/sites/academic-quality/documents/taught-code/annexes/university-examination-regulations.pdf)
This assignment is undertaken by the abovementioned bioinformatician based on the original publication Research Article in Journal of Bacteriology, American Society for Microbiology-Activity of a Holin-Endolysin System in the Insecticidal Pathogenicity Island of Yersinia enterocolitica (Katharina, S. 2018) Study of a lysis cassette in Tc-PAIY of four ORFs in Yersinia enterocolitica.(Springer et al., 2018)
Table of Contents
Section A Assessmment Cover Sheet............................................................................................ 0
0.0 Assessment..................................................................................................0.0
1.0 Overview ................................................................................. 0.0
2.0 Introduction/Background ....................................................................... 0.0
3.0 Project Planning.............................................................................................0.0
4.0 Literature Review ............................................................................................0.0
5.0 Results...............................................................................0.0
6.0 Further Visualisation and Analysis...................................................................................0.0
7.0 Applications and Insights .................................................................................. 0.0
8.0 Real-world Relevance ....................................................................... 0.0
9.0 Future Research (FOLLOW ON GITHUB FOR MORE) ..................................................................................................0.0
10.0 Conclusion.............................................................................................0.0
References/Citations.............................................................................................0.0
Connor, M. (1992)
# 0.0 ASSESSMENT
# In tending to the evaluation necessities, the culmination of tasks 1, 2, and 3 for the genome NewStrain1 involved a
#methodical methodology. Task 1 required the arrangement of the genomic dataset by guaranteeing the right designing of the
#fasta record, setting the establishment for resulting examinations. Task 2 included executing a BLAST analysis utilizing
#the BLASTp command-line instrument, questioning the arranged dataset against the Yersinia genome database.
#The assessment of arrangement scores, E-values, and percent characters during manual analysis affirmed the presence of the
#four genes of premium from the lysis cassette in Tc-PAIY in NewStrain1. Task 3 was achieved by fostering a Python script
#for succession recovery, empowering the group to separate explicit quality arrangements in view of their ID numbers.
#Pushing ahead, Task 4 could be accomplished by planning a complete pipeline and manual to direct the group through the
#reiteration of investigations on an alternate dataset. This includes data planning, BLAST analysis, manual examination of
#results, and execution of a script for succession recovery, with the manual giving nitty gritty directions to each step.
#In a social environment, correspondence, joint effort, and a mutual perspective of tasks and objectives are principal.
#Powerful designation of obligations, clear documentation, and customary advancement refreshes add to a fruitful and strong
#collective endeavor. The significance of keeping up with open lines of correspondence and regarding different viewpoints
#encourages a cooperative climate, guaranteeing the group's aggregate achievement.
# 1.O OVERVIEW
# The Workflow
# The whole workflow is devised to effectively work as a modular bioinformatic pipeline with independent but interrelated
#parts. Installations, steps, and actions needed to be taken are explained here but it is recommended to consult the NCBI
#BLAST User Manual and the Anaconda Navigator Manual attached (all credits to respective owners)(NCBI BLAST, 2015).
# Complete Anaconda-NavigatorTM User Manual attached. Documentation hosted by Read the Docs©. Click the following link to
#access Read the Docs Anaconda-Navigator Manual ((Anaconda, 2018). Created using Sphinx 5.0.2. Built with the PyData Sphinx
#Theme 0.14.4.
# Also, find attached complete Anaconda-NavigatorTM Tutorial. Documentation hosted by Read the Docs©. Click the following
#link to access Read the Docs Anaconda-Navigator Tutorial ((Anaconda, 2018). Created using Sphinx 5.0.2. Built with the
#PyData Sphinx Theme 0.14.4.
# With working directory attached is an Improv Academic Blueprint-A preliminary literature review integrated as part of
#modular pipeline which defined this pipeline construction, design and build. Please refer to the document for thorough
#referential enrichmment of scientific and academic concepts governing this workflow.
2.0 INTRODUCTION/BACKGROUND
A detailed analysis of BLAST for the Yersinia project was carried out to investigate the genomic features of a newly recognized strain Yersinia. The analysis particularly concentrated on four genes of interest; possibly associated with the phage holin family – a group of proteins involved in bacterial lysis. The main objectives were to determine if these genes characterize the new strain, how their sequences vary, and how they can be compared against sequences from different bacterial strains, thus shedding light on their functional significance and evolutionary implications
![IMAGE] (https://github.com/2502495/Project-2/blob/a3c456a018655b38858e66ff15edd5e74a66a7e3/Picture%201.png)
Figure 1. Pathogenecity Island in Yersinia enterocolitica W22703
Fuchs, T.M., Bresolin, G., Marcinowski, L. et al., BMC Microbiol 8, 214 (2008).
# A detailed analysis of BLAST for the Yersinia project was carried out to investigate the genomic features of a
# newly recognized strain Yersinia. The analysis particularly concentrated on four genes of interest;
# possibly associated with the phage holin family – a group of proteins involved in bacterial lysis.
# The main objectives were to determine if these genes characterize the new strain, how their sequences vary,
# and how they can be compared against sequences from different bacterial strains, thus shedding light on their functional
# significance and evolutionary implications.
# Holin:
# Holin, a transmembrane protein with 1-4 transmembrane sequences, exhibits diverse classifications based on topological,
# phylogenetic, and motif analyses, resulting in the identification of 58 families and 7 superfamilies. Experimental evidence
# from previous cell killing assays indicates that the presence of both the C-terminal region and transmembrane domains is
# essential for the generation of pores in the inner cell membrane.
# Endolysin:
# Endolysins, enzymatic agents pivotal in bacterial cell lysis, possess a structural composition comprising distinct
# elements: catalytic domain, cell wall binding domain, linker regions, and accessory domains. Their primary
# function involves the targeted degradation of the bacterial cell walls peptidoglycan, facilitating the release of phage
# progeny. Notably, the N-terminal cell wall binding domain recognizes specific bacterial components, while the
# C-terminal catalytic domain executes enzymatic activity.
# i-spanin and o-spanin:
# Encoded by the RzRz1 genes in both lambda and φ21 phages, i-spanin and o-spanin represent proteins integral to the
# spanin complex. The proposed model for spanin function elucidates its role in the removal of the outer membrane barrier
# by fusing it to the inner membrane. Interaction between i-spanin and o-spanin occurs through their C-terminal regions,
# wherein conformational changes in the two alpha-helix regions of i-spanin facilitate the fusion of the two membranes,
#ultimately leading to the formation of pores on the bacterial membrane.
#3.0 PROJECT PLANNING
#To smooth out the analysis interaction and empower the group to rehash the examinations on an alternate dataset,
#a pipeline alongside a complete manual can be given. Here is an improved version of pipeline:
#Pipeline:
#1. Step 1: Data Preparation
#• Input: Fasta file of the new Yersinia strain genome.
#• Output: Prepared dataset.
#2. Step 2: BLAST Analysis
#• Input: Prepared dataset and known protein sequences.
#3• Output: BLAST results file.
#3. Step 3: BLAST Results Analysis
#• Input: BLAST results file.
#• Output: Confirmation of gene presence.
#4. Step 4: Script Execution
#• Input: Fasta file, target gene ID.
#• Output: Retrieved gene sequence.
#Manual
#• Detailed instructions for each step.
#• Necessary parameters and commands for tools/scripts.
#• Troubleshooting guidance.
#Explanation of Findings
#• Clarification of the significance of alignment scores, E-values, and percent identities.
#• Confirmation or rejection of the presence of genes based on the analysis.
#• Interpretation of any unexpected results or discrepancies.
The team expects to affirm the presence of the four genes from the lysis cassette in Tc-PAIY inside their new strain of Yersinia, signified as NewStrain1. Following the execution of a BLAST analysis utilizing the BLASTp command-line device, questioning NewStrain1 against the Yersinia genome database, the group cautiously inspects the got results. The analysis centers around the alignment scores, E-values, and percent characters of the matches to the realized protein successions from the lysis cassette. Upon intensive examination, the group sees that the alignment scores are quite high, showing hearty likeness. The E-values are low, recommending a high statistical meaning of the matches, and the percent characters confirm a substantial comparability between the inquiry and database groupings. These discoveries by and large affirm the presence of the four genes of interest in the lysis cassette in Tc-PAIY inside the genomic cosmetics of NewStrain1.
#Import Required Libraries
from Bio import SeqIO
from Bio.Blast import NCBIWWW
import os
from Bio import Entrez
# Set your email here
Entrez.email = "your_email@example.com"
# List of provided protein IDs
ids = [
"CAI77377.1",
"AAT90758.1",
"WP_004875774.1",
"WP_050116509.1",
"AAM83780.1",
"ABX88260.1",
"CAH22792.1",
"CAE13263.1",
"AAG55972.1",
"WP_047345313.1",
"WP_047964150.1",
"AAQ61208.1"
]
# Open a file to write the sequences
# Running the Script
#Place the FASTA File: Ensure the holin_sequences.fasta file is in the same directory as the script.
#Open Command Line: Use Terminal (Mac/Linux) or Command Prompt (Windows).
#Navigate to Script Directory: Change directory to where the script is located using cd path/to/script.
#Run the Script: Execute the script by typing python script_name.py (replace script_name.py with the actual script filename).
with open("holin_sequences.fasta", "w") as file:
for i in ids:
# Fetch each sequence from NCBI and write to the file
handle = Entrez.efetch(db="protein", id=i, rettype="fasta", retmode="text")
file.write(handle.read())
In Analysis to decide the protection of the four genes from the lysis cassette in Tc-PAIY inside the new strain NewStrain1 contrasted with other Yersinia strains, the group would broaden their examination. Following the BLAST analysis, the center movements to surveying the likeness of the recognized genes across a scope of Yersinia strains. The group looks at the alignment scores, E-values, and percent characters of the matched genes in NewStrain1 against those in different Yersinia strains recorded in the database. On the off chance that the alignment scores are reliably high and the E-values are reliably low across various strains, it proposes a serious level of protection. Also, noticing high percent personalities would additionally uphold the thought that the genes are moderated among various Yersinia strains. These discoveries would demonstrate that the distinguished genes from the lysis cassette in Tc-PAIY are available as well as profoundly monitored in NewStrain1 contrasted with other Yersinia strains, suggesting utilitarian importance across various genomic foundations.
3. Be able to retrieve their specific protein sequence from the results (or any other in the future)
#The research group imagines a deliberate pipeline intended to investigate extra genes of premium inside the genomic scene
#of the new Yersinia strain, NewStrain1. The initial step includes setting up the genomic dataset, guaranteeing appropriate
#designing to work with downstream investigations. Consequently, the group utilizes the BLASTp command-line instrument to
#direct an underlying analysis against an exhaustive quality database, producing a rundown of potential genes of interest.
#To refine and focus on this rundown, a separating cycle is carried out, taking into account alignment scores, E-values,
#and #other significant models. The third step coordinates practical explanation tools, like InterProScan, to clarify the
#potential jobs and elements of the recognized up-and-comer genes. The refined rundown is additionally approved through a
#complete writing survey or computerized tools for writing mining, guaranteeing that the chose genes have known capabilities
#or affiliations. As a closing step, a discretionary arrangement for experimental approval is incorporated, considering
#further affirmation of the distinguished genes through research center investigations or joint efforts with experimental
#biologists. This comprehensive pipeline offers an organized and precise methodology for the group to investigate,
#distinguish, and examine extra genes of premium in NewStrain1's genomic setting.
#Function to Retrieve Sequences from a FASTA File
def retrieve_sequence(fasta_file, gene_id):
for record in SeqIO.parse(fasta_file, "fasta"):
if record.id == gene_id:
return record.seq
return None
A. #Function to Perform BLAST Analysis
def blast_sequence(sequence, database="nr", program="blastp", format_type="Text"):
result_handle = NCBIWWW.qblast(program, database, sequence, format_type=format_type)
return result_handle.read()
# Main Workflow
def main():
# Define the path to your FASTA file and gene IDs
fasta_file = "holin_sequences.fasta"
gene_ids = [
"CAI77377.1",
"AAT90758.1",
"WP_004875774.1",
"WP_050116509.1",
"AAM83780.1",
"ABX88260.1",
"CAH22792.1",
"CAE13263.1",
"AAG55972.1",
"WP_047345313.1",
"WP_047964150.1",
"AAQ61208.1"
]
# Loop through each gene ID
for gene_id in gene_ids:
sequence = retrieve_sequence(fasta_file, gene_id)
if sequence:
print(f"Sequence for {gene_id}: {sequence}")
# Perform BLAST analysis
blast_result = blast_sequence(sequence)
# Save BLAST results to a file
with open(f"{gene_id}_blast_results.txt", "w") as result_file:
result_file.write(blast_result)
print(f"BLAST results for {gene_id} saved.")
else:
print(f"Sequence for {gene_id} not found in {fasta_file}")
if __name__ == "__main__":
main()
# Output file for consolidated results
fasta_file = "holin_sequences.fasta"
gene_ids = [
"CAI77377.1",
"AAT90758.1",
"WP_004875774.1",
"WP_050116509.1",
"AAM83780.1",
"ABX88260.1",
"CAH22792.1",
"CAE13263.1",
"AAG55972.1",
"WP_047345313.1",
"WP_047964150.1",
"AAQ61208.1"
]
with open("consolidated_blast_results.txt", "w") as consolidated_file:
# Loop through each gene ID
for gene_id in gene_ids:
sequence = retrieve_sequence(fasta_file, gene_id)
if sequence:
# Write sequence to the consolidated file
consolidated_file.write(f"Sequence for {gene_id}:\n{sequence}\n")
# Perform BLAST analysis
blast_result = blast_sequence(sequence)
# Save BLAST results to individual files
with open(f"{gene_id}_blast_results.txt", "w") as result_file:
result_file.write(blast_result)
# Write BLAST results to the consolidated file
consolidated_file.write(f"BLAST results for {gene_id}:\n{blast_result}\n")
print(f"BLAST results for {gene_id} saved.")
else:
print(f"Sequence for {gene_id} not found in {fasta_file}")
# 4.0 IMPROMPTU LITERATURE REVIEW
#The analysis identified the presence of homologous genes to phage holin family proteins, notably through proteins
#“WP_005163187.1” and “MBO0362545.1”. These proteins had high alignment scores and low E-values in the BLAST results,
#meaning that they closely matched the genes of interest in this new strain of Yersinia. This visionary evidence also
#indicates that these phage holin-related genes have been incorporated into the genome of the new strain, which might bear
#their influence on the biological characteristics of this new strain such as its pathogenicity.
Analysis of BLAST Results for NewYersinia Strains: Implications for Pathogenicity
NewYersinia1:
Analysis of the BLAST results for NewYersinia1 (Figure 3) reveals a high degree of sequence similarity between the query sequence and all matched sequences. This is evidenced by:
High identity: All matched sequences exhibit above 80% identity, ranging from 81.513% to 93.421%. This indicates a strong conservation of amino acid residues, suggesting a close evolutionary relationship and potentially shared functional properties. Significant e-values: The e-values of all matches are below 1.41E-48, indicating a very low probability of finding an equally good alignment by chance. This further strengthens the confidence in the identified sequences as true homologs. High bit scores: Bit scores exceeding 145 in all matches suggest a statistically significant and robust alignment, further supporting the close relationship between the query and subject sequences. Based on these parameters, the BLAST results for NewYersinia1 strongly suggest that the analyzed strain possesses a high degree of similarity to known pathogenic Yersinia strains. This finding warrants further investigation into the potential virulence factors and pathogenicity determinants present in NewYersinia1.
![BLAST RESULTS IMAGE] (https://github.com/2502495/Project-2/blob/f040eb181a3f06670f49d260b2ea001cb27a58a3/NY1_BLAST%20result.png)
Figure 2. NewYersinia2: W22703 Yersinia1 BLAAST Results
The BLAST results for NewYersinia2 (Figure 4) present a contrasting picture. Only three matched sequences were identified, corresponding to Holin, Endolysin, and i-spanin proteins. Notably:
Partial matches: The identities of these matched sequences range from 54.622% to 78.195%, indicating only partial similarity to the query sequences. This suggests potential divergence or functional modifications within these proteins compared to known Yersinia strains. Significant e-values: Similar to NewYersinia1, the e-values of all matches are below 8.21E-41, suggesting a low probability of chance occurrence. This reinforces the confidence in the identified sequences as relevant homologs, despite their partial similarity. High bit scores: Bit scores exceeding 128 for all matches indicate statistically significant alignments, further supporting the identified matches as true homologs. The absence of a matched sequence resembling o-spanin in the NewYersinia2 protein database is a noteworthy observation. O-spanin plays a crucial role in the Holin-Endolysin system, facilitating cell wall lysis during bacteriophage infection. The lack of a homologous sequence suggests a potential dysfunction or modification within this critical system, potentially impacting the ability of NewYersinia2 to effectively lyse host cells.
![BLAST RESULTS IMAGE] (https://github.com/2502495/Project-2/blob/f040eb181a3f06670f49d260b2ea001cb27a58a3/NY2_BLAST%20result.png)
Figure 3. W22703 Yersinia2 BLAST Results
The contrasting BLAST results for NewYersinia1 and NewYersinia2 suggest distinct evolutionary trajectories and potential differences in their pathogenic potential. NewYersinia1 exhibits high sequence similarity to known pathogenic strains, suggesting a strong likelihood of virulence. Conversely, NewYersinia2 displays partial matches for key proteins involved in bacteriophage infection, along with a notable absence of o-spanin, hinting at a potential dysfunction in the Holin-Endolysin system and potentially reduced virulence capabilities. Further investigation is warranted to elucidate the specific virulence factors and host-pathogen interactions associated with each strain.
The analysis identified the presence of homologous genes to phage holin family proteins, notably through proteins “WP_005163187.1” and “MBO0362545.1”. These proteins had high alignment scores and low E-values in the BLAST results, meaning that they closely matched the genes of interest in this new strain of Yersinia. This visionary evidence also indicates that these phage holin-related genes have been incorporated into the genome of the new strain, which might bear their influence on the biological characteristics of this new strain such as its pathogenicity. One of the notable findings from BLAST results was the high sequence conservation of these phage holin-related genes. Several sequences from different Yersinia species, relative to the query protein “CAI77377.1” show a high level of similarity. This statement emphasizes a significant evolutionary conservation, implying that these genes possess vital biological functions and are presumably preserved because they are important to the survival, and pathogenicity of Yersinia.
#5.0 FURTHER VISUALISATION AND ANALYSIS
#4. To understand the different outputs from BLAST and how to analyse them
#In the situation where the research group tries to fathom the different results created by BLAST (Basic Local Alignment
#Search Apparatus) and foster a comprehension of how to break down them, a precise methodology is framed.
#1.1 Perform BLAST Analysis
#To start the cycle, the group chooses the DNA or protein successions of interest, explicitly those beginning from the
#four genes inside the lysis cassette in Tc-PAIY, to act as queries in the BLAST analysis. The decision of the BLAST
#apparatus is dependent upon the idea of the arrangements, whether they are nucleotide or protein, and sticks to explicit
#database necessities.
#1.2 Examine BLAST Output
#Upon execution, the BLAST device produces a result document outfitting alignments between the question arrangements and
#those present in the chose database. The resultant data normally incorporates a thorough rundown of matches, including
#alignment scores, E-values meaning statistical importance, and other relevant data fundamental for additional analysis.
#1.3 Interpret Alignment Scores
#In this situation, the research group digs into the assessment of alignment scores. Higher scores are proposed to
#connote more grounded likenesses between the inquiry successions and the hits inside the database, giving a
#quantitative proportion of alignment quality.
#1.4 Evaluate E-values
#The statistical meaning of the alignments is surveyed through E-values, addressing the expected number of irregular
#hits having a comparative or unrivaled score. Lower E-values are focused on, demonstrating higher trust in the noticed
#similitudes.
#1.5 Assess Sequence Identities and Similarities
#The group examines the level of character and similitude inside the BLAST results. These measurements explain the level of
#resemblance between the question successions and the matched database groupings, assuming a significant part in grasping
#grouping preservation.
#Visualize Alignments
#To upgrade understanding, the group might decide to utilize perception tools or programming, working with the graphical
#portrayal of alignments. This perception supports clarifying explicit locales of comparability and disparity inside the
#successions.
#Filter and Refine Results
#Contingent upon the targets of the research, the group takes part in the filtration and refinement of BLAST results in
#light of foreordained standards. Boundaries like alignment length, percent character, or E-esteem limits are potentially
#utilized to distil the most appropriate data.
#Compare Multiple Hits
#In getting different hits, the research group embraces a near analysis. Different boundaries are considered to focus on
#and select the most pertinent matches among the plenty of results.
#Document and Report Findings
#At long last, the group participates in fastidious documentation and announcing of their discoveries. This complete record
#incorporates chosen hits, nitty gritty alignments, and relevant measurements, guaranteeing straightforwardness and giving
#a primary asset to future reference and analysis.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 4.1: Pathogenicity islands\n",
"\n",
"This exercise was inspired by [Libeskind-Hadas and Bush, *Computing for Biologists*, Cambridge University Press, 2014](https://www.cs.hmc.edu/CFB).\n",
"\n",
"\<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this and the [next problem](exercise_4.2.ipynb), we will work with real data from the *Yersinia Enterocolitica* genome. The section of the genome we will work with is in the file `~git/bootcamp/data/Yersinia_spi1_region.fna`. I cut it out of the [full genome](http://www.ncbi.nlm.nih.gov/nucleotide/821161554). It contains *Yersinia* pathogenicity island I (SPI1), which contains genes for surface receptors for host-pathogen interactions.\n",
"\n",
"Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the *Yersinia* genome by computing GC content.\n",
"\n",
"**a)** Use principles of TDD to write a function that divides a sequence into blocks and computes the GC content for each block, returning a tuple. The function signature should look like\n",
"\n",
" gc_blocks(seq, block_size)\n",
" \n",
"To be clear, if `seq = 'ATGACTACGT'` and `block_size = 4`, the blocks to be considered are\n",
"\n",
" ATGA\n",
" CTAC\n",
" \n",
"and the function should return `(0.25, 0.5)`. Note that the blocks are non-overlapping and that we don't bother with the fact that end of the sequence that does not fit completely in a block."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** Write a function that takes as input a sequence, block size, and a threshold GC content, and returns the original sequence where every base in a block with GC content above threshold is capitalized and every base below the threshold is lowercase. You would call the function like this:\n",
"\n",
" mapped_seq = gc_map(seq, block_size, gc_thresh)\n",
"\n",
"For example, \n",
"\n",
" gc_map('ATGACTACGT', 4, 0.4)\n",
"\n",
"returns `'atgaCTAC'`. Note that bases not included in GC blocks (in this case the `GT` at the end of the sequence) are not included in the output sequence. Again, use principles of TDD."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**c)** Use the `gc_map()` function to generate a GC content map for the *Yersinia* sequence with `block_size = 1000` and `gc_thresh = 0.45`. Where do you think the pathogenicity island is?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**d)** Write the GC-mapped sequence (with upper and lower characters) to a new FASTA file. Use the same description line (which began with a `>` in the original FASTA file), and have line breaks every 60 characters in the sequence."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\<br />"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solution\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** First let's write our tests. In writing the tests, I am making the design decision that I will count the characters `G`, `g`, `C`, and `c` as contributing to GC content, and that I will not check to make sure the sequence is valid. I also make the design decision that an empty sequence has zero GC content."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def test_gc_content():\n",
" assert gc_content('') == 0.0\n",
" assert gc_content('G') == 1.0\n",
" assert gc_content('g') == 1.0\n",
" assert gc_content('C') == 1.0\n",
" assert gc_content('c') == 1.0\n",
" assert gc_content('gcgcgc') == 1.0\n",
" assert gc_content('aaatatata') == 0.0\n",
" assert np.isclose(gc_content('ggatcggcga'), 0.7)\n",
" assert np.isclose(gc_content('attgggggcaatta'), 3/7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function is fairly simple. We loop through the sequence with a stride equal to the block size, computing the GC content for each subsequence of that length. We start with a function to compute GC content for a sequence."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def gc_content(seq):\n",
" \"\"\"GC content of a given sequence\"\"\"\n",
" if seq == '':\n",
" return 0.0\n",
" \n",
" seq = seq.upper()\n",
" return (seq.count('G') + seq.count('C')) / len(seq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's test it."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"test_gc_content()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Passage! Next, we write the looping function, starting with its tests."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def test_gc_blocks():\n",
" assert gc_blocks('', 10) == tuple()\n",
" assert gc_blocks('gcgcgcgcg', 10) == tuple()\n",
" assert gc_blocks('gcgcgcg', 4) == (1.0,)\n",
" assert gc_blocks('gcgcgcgc', 4) == (1.0, 1.0)\n",
" assert gc_blocks('gcgcgcgcat', 4) == (1.0, 1.0)\n",
"\n",
" test_tuple = gc_blocks('gcgagcgcat', 4)\n",
" assert np.isclose(test_tuple[0], 0.75) and test_tuple[1] == 1.0\n",
" \n",
" assert gc_blocks('gcgtagagc', 1) == (1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0)\n",
" assert gc_blocks('gcgtagagc', 2) == (1.0, 0.5, 0.5, 0.5)\n",
" assert np.isclose(gc_blocks('gcgtagagc', 3), (1.0, 1/3, 2/3)).sum() == 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's write our function."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def gc_blocks(seq, block_size):\n",
" \"\"\"\n",
" Divide sequence into non-overlapping blocks\n",
" and compute GC content of each block.\n",
" \"\"\"\n",
" blocks = []\n",
" for i in range(0, len(seq) - (len(seq) % block_size), block_size):\n",
" blocks.append(gc_content(seq[i:i+block_size]))\n",
" return tuple(blocks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the tests...."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"test_gc_blocks()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Success! Let's take this function for a spin, looking at 1000-base blocks. We will use the FASTA reader function from a previous exercise to read in the _Yersinia_ genome fragment."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.521,\n",
" 0.556,\n",
" 0.54,\n",
" 0.498,\n",
" 0.551,\n",
" 0.508,\n",
" 0.563,\n",
" 0.484,\n",
" 0.58,\n",
" 0.557,\n",
" 0.523,\n",
" 0.524,\n",
" 0.621,\n",
" 0.556,\n",
" 0.481,\n",
" 0.57,\n",
" 0.581,\n",
" 0.614,\n",
" 0.603,\n",
" 0.526,\n",
" 0.524,\n",
" 0.591,\n",
" 0.563,\n",
" 0.596,\n",
" 0.563,\n",
" 0.6,\n",
" 0.613,\n",
" 0.594,\n",
" 0.486,\n",
" 0.554,\n",
" 0.566,\n",
" 0.592,\n",
" 0.563,\n",
" 0.537,\n",
" 0.575,\n",
" 0.501,\n",
" 0.54,\n",
" 0.555,\n",
" 0.487,\n",
" 0.416,\n",
" 0.423,\n",
" 0.371,\n",
" 0.394,\n",
" 0.48,\n",
" 0.454,\n",
" 0.474,\n",
" 0.434,\n",
" 0.396,\n",
" 0.37,\n",
" 0.456,\n",
" 0.409,\n",
" 0.457,\n",
" 0.4,\n",
" 0.405,\n",
" 0.475,\n",
" 0.47,\n",
" 0.479,\n",
" 0.494,\n",
" 0.497,\n",
" 0.516,\n",
" 0.444,\n",
" 0.433,\n",
" 0.471,\n",
" 0.458,\n",
" 0.53,\n",
" 0.458,\n",
" 0.56,\n",
" 0.427,\n",
" 0.47,\n",
" 0.438,\n",
" 0.465,\n",
" 0.473,\n",
" 0.46,\n",
" 0.399,\n",
" 0.426,\n",
" 0.359,\n",
" 0.469,\n",
" 0.433,\n",
" 0.425,\n",
" 0.504,\n",
" 0.578,\n",
" 0.576,\n",
" 0.553,\n",
" 0.531,\n",
" 0.57,\n",
" 0.599,\n",
" 0.562,\n",
" 0.555,\n",
" 0.595,\n",
" 0.586,\n",
" 0.55,\n",
" 0.56,\n",
" 0.545,\n",
" 0.553,\n",
" 0.537,\n",
" 0.519,\n",
" 0.519,\n",
" 0.567,\n",
" 0.551,\n",
" 0.548,\n",
" 0.559,\n",
" 0.527,\n",
" 0.559,\n",
" 0.529,\n",
" 0.49,\n",
" 0.533,\n",
" 0.58,\n",
" 0.545,\n",
" 0.558,\n",
" 0.575,\n",
" 0.555,\n",
" 0.49,\n",
" 0.567,\n",
" 0.515,\n",
" 0.518,\n",
" 0.485,\n",
" 0.38,\n",
" 0.461,\n",
" 0.568,\n",
" 0.575,\n",
" 0.567,\n",
" 0.57,\n",
" 0.472,\n",
" 0.513,\n",
" 0.582,\n",
" 0.476,\n",
" 0.505,\n",
" 0.524,\n",
" 0.51,\n",
" 0.512,\n",
" 0.391,\n",
" 0.463,\n",
" 0.57,\n",
" 0.546,\n",
" 0.535,\n",
" 0.525,\n",
" 0.525,\n",
" 0.529,\n",
" 0.58,\n",
" 0.555,\n",
" 0.558,\n",
" 0.563,\n",
" 0.525,\n",
" 0.505,\n",
" 0.557,\n",
" 0.554,\n",
" 0.484,\n",
" 0.525,\n",
" 0.567,\n",
" 0.467,\n",
" 0.527,\n",
" 0.55,\n",
" 0.577,\n",
" 0.554,\n",
" 0.538,\n",
" 0.429,\n",
" 0.507,\n",
" 0.557,\n",
" 0.592,\n",
" 0.595,\n",
" 0.554,\n",
" 0.521,\n",
" 0.539,\n",
" 0.521,\n",
" 0.45,\n",
" 0.608,\n",
" 0.489,\n",
" 0.477,\n",
" 0.552,\n",
" 0.508,\n",
" 0.544,\n",
" 0.495,\n",
" 0.543,\n",
" 0.56,\n",
" 0.596,\n",
" 0.547,\n",
" 0.581,\n",
" 0.548,\n",
" 0.537,\n",
" 0.529,\n",
" 0.513,\n",
" 0.499,\n",
" 0.545,\n",
" 0.567,\n",
" 0.52,\n",
" 0.545,\n",
" 0.548,\n",
" 0.522,\n",
" 0.533,\n",
" 0.558,\n",
" 0.586,\n",
" 0.469,\n",
" 0.516,\n",
" 0.509,\n",
" 0.511,\n",
" 0.569,\n",
" 0.575,\n",
" 0.559,\n",
" 0.545,\n",
" 0.502)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def read_fasta(filename):\n",
" \"\"\"Read a sequence in from a FASTA file containing a single sequence.\n",
" \n",
" We assume that the first line of the file is the descriptor and all\n",
" subsequent lines are sequence. \n",
" \"\"\"\n",
" with open(filename, 'r') as f:\n",
" # Read in descriptor\n",
" descriptor = f.readline().rstrip()\n",
"\n",
" # Read in sequence, stripping the whitespace from each line\n",
" seq = ''\n",
" line = f.readline().rstrip()\n",
" while line != '':\n",
" seq += line\n",
" line = f.readline().rstrip()\n",
" \n",
" return descriptor, seq\n",
"\n",
"\n",
"descriptor, seq = read_fasta('data/Yersinia_spi1_region.fna')\n",
"\n",
"gc_blocks(seq, 1000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We get a tuple of GC content, which is hard to look at on screen, but this is useful for plotting GC content over the course of a sequence. We will learn how to plot later in the bootcamp."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** We just use our already-written `gc_content()` function to decide how to modify the string of the sequence. First, the tests. We make the design decision that we will truncate the sequence if the 3'-most end is shorter than the block length."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def test_gc_map():\n",
" assert gc_map('', 10, 0.5) == ''\n",
" assert gc_map('ATATATATA', 4, 0.5) == 'atatatat'\n",
" assert gc_map('GCGCGCGCG', 4, 0.5) == 'GCGCGCGC'\n",
" assert gc_map('GATCGATCC', 4, 0.5) == 'GATCGATC'\n",
" assert gc_map('GATCGATCC', 4, 0.51) == 'gatcgatc'\n",
" assert gc_map('GATCGATCC', 3, 0.5) == 'gatCGATCC'\n",
" assert gc_map('GATCGATCC', 3, 0.75) == 'gatcgatcc'\n",
" assert gc_map('GATCGATCC', 3, 0.25) == 'GATCGATCC'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now the function...."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def gc_map(seq, block_size, gc_thresh):\n",
" \"\"\"Give back seq with lowercase letters where GC content is low.\"\"\" \n",
" out_seq = ''\n",
"\n",
" # Determine GC content of each block and change string accordingly\n",
" for i in range(0, len(seq) - (len(seq) % block_size), block_size):\n",
" if gc_content(seq[i:i+block_size]) \< gc_thresh:\n",
" out_seq += seq[i:i+block_size].lower()\n",
" else:\n",
" out_seq += seq[i:i+block_size].upper()\n",
"\n",
" return out_seq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the tests."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"test_gc_map()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Passage! We can now use these functions to analyze sequences of interest."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**c)** Let's do it for *Yersinia*!"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"sal_gcmap = gc_map(seq, 1000, 0.45)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To save on display space, we will not display the sequence here. Scrolling through the GC map file generated in the next part, the pathogenicity island appears to occur about a quarter of the way into this subsequence."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**d)** To write the file out, we use the fact that we conveniently kept the description text when we parsed the *Yersinia* FASTA file in the first place. We then just write the `sal_gcmap` string in blocks of 60. We have to make sure to get the last few bases as well."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Write the result\n",
"with open('Yersinia_spi1_region_gc_map.fna', 'w') as f:\n",
" # Write description text\n",
" f.write(descriptor + '\\n')\n",
"\n",
" # Write sequence in blocks of 60\n",
" i = 0\n",
" while i \< len(sal_gcmap) - 59:\n",
" f.write(sal_gcmap[i:i+60] + '\\n')\n",
" i += 60\n",
" \n",
" # Write last line\n",
" f.write(sal_gcmap[i:] + '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll take a quick look to see it worked out ok."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
">gi|821161554|gb|CP011428.1| Yersinia Enterocolitica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000\n",
"AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTC\n",
"ACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCAC\n",
"CCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGT\n",
"CAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGG\n",
"CATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGT\n",
"CATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTT\n",
"TATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGC\n",
"TTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAG\n",
"AGCTGCCGCATCCGTACCACCGGCTTGCGCCATGTCCGGACGACCGCCACCCTTACCGCC\n",
"...\n",
"ACGCATTTCTCCCGTGCAGGTCACATTTGCCCGACACGGCGGGGCAAGAGGCTTGAACAG\n",
"ACGTTCATTTTCCGTAAAACTGGCGTAATGTAAGCGTTTACCCACTATAGGTATTATCAT\n",
"GGCGACCATAAAAGATGTAGCCCGACTGGCCGGTGTTTCAGTCGCCACCGTTTCTCGCGT\n",
"TATTAACGATTCGCCAAAAGCCAGCGAAGCGTCCCGGCTGGCGGTAACCAGCGCAATGGA\n",
"GTCCCTGAGCTATCACCCTAACGCCAACGCGCGCGCGCTGGCACAGCAGGCAACGGAAAC\n",
"CCTCGGTCTGGTGGTCGGCGACGTTTCCGATCCTTTTTTCGGCGCGATGGTGAAAGCCGT\n",
"TGAACAGGTGGCGTATCACACCGGCAATTTTTTACTGATTGGCAACGGGTATCATAACGA\n",
"ACAAAAAGAGCGTCAGGCTATTGAACAGTTGATTCGTCATCGTTGCGCAGCGTTAGTGGT\n",
"GCACGCCAAAATGATTCCGGATGCGGACCTGGCCTCATTAATGAAGCAAATCCCCGGCAT\n",
"GGTGCTGATTAACCGCATTT\n"
]
}
],
"source": [
"!head Yersinia_spi1_region_gc_map.fna\n",
"print('...')\n",
"!tail Yersinia_spi1_region_gc_map.fna"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks good!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing environment"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPython 3.7.7\n",
"IPython 7.13.0\n",
"\n",
"numpy 1.18.1\n",
"jupyterlab 1.2.6\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -v -p numpy,jupyterlab"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
![Holin] (https://github.com/2502495/Project-2/blob/5513b8cd6cd593ad4906945700b61b1d49b907ea/Holin.png) Figure 4. Holin SWISS Analysis
! [Endolysin] (https://github.com/2502495/Project-2/blob/5513b8cd6cd593ad4906945700b61b1d49b907ea/Endolysin.png)
Figure 5. Endolysin SWISS Analysis
![i-Spanin] (https://github.com/2502495/Project-2/blob/5513b8cd6cd593ad4906945700b61b1d49b907ea/i-spanin.png)
Figure 6. i-Spanin SWISS Analysis
![o-Spanin] (https://github.com/2502495/Project-2/blob/5513b8cd6cd593ad4906945700b61b1d49b907ea/o-spanin.png)
Figure 7. o-Spanin SWISS Analysis
The BLAST analysis gives an in-depth look into the genetic architecture of the new Yersinia strain, especially concerning its phage holin-linked genes. Since the conservation and prevalence of these genes imply an important aspect in the bacterium’s life cycle, pathogenic mechanisms may be influenced by them. Learning how these conserved genes function and interact can provide insights into Yersinia’s techniques for survival, virulence, and adaptation
The BLAST analysis gives an in-depth look into the genetic architecture of the new Yersinia strain, especially concerning its phage holin-linked genes. Since the conservation and prevalence of these genes imply an important aspect in the bacterium’s life cycle, pathogenic mechanisms may be influenced by them. Learning how these conserved genes function and interact can provide insights into Yersinia’s techniques for survival, virulence, and adaptation
The genetic diversity shown in the genomic comparison with different Yersinia species emphasizes the complex evolutionary processes within the genus. It allows for further research opportunities regarding the role these conserved genes play in enhancing the adaptability and pathogenicity of Yersinia, thus understanding bacterial evolution, and host-pathogen interactions better.
Overall, the study using BLAST analysis of the novel Yersinia strain detected phage holin-related genes in this bacterium; these latter possess a high degree of conservation across all Yersinia species and propose possible functional roles. These observations contribute greatly to the knowledge of the genetic heterogeneity and virulence features of Yersinia, laying a strong basis for future genomic and functional studies in microbiology
Springer K, Reuter S, Knüpfer M, Schmauder L, Sänger PA, Felsl A, Fuchs TM. Activity of a Holin-Endolysin System in the Insecticidal Pathogenicity Island of Yersinia enterocolitica. J Bacteriol. 2018 Jul 25;200(16):e00180-18. doi: 10.1128/JB.00180-18. PMID: 29866807; PMCID: PMC6060350.